Star Ratings Cutpoints using Clustering

This tutorial will guide you thru how CMS calculates the cutpoints used in the Star Ratings program.

CMS publishes the data related to their Star Ratings on their Part C and D Performance Data webpage. I have downloaded the 2022 Star Ratings Data Table and unzipped it to a directory on my computer.

Now, I need to read the excel file into a dataframe. I've also used a converter to remove some spaces after the contract IDs and renamed some of the columns.

I create a new column for the contract type. The contract type is determined by looking at the first character of the contract ID.

Next, I need to transform and clean up the data a bit. I use the melt() function to reorient the data and create a column for the measure name and rate.

Then, I apply a function that attempts to convert the rates to numbers and drops contracts without a rate.

Take a look at the cleaned up data.

Now, the real work begins.

CMS using uses mean resampling to determine the cutpoints. They create 10 equal-sized groups and then apply the clustering algorithm 10 types leaving one group out each time.

For the purpose of this tutorial, I have limited the analysis to MAPD contracts and the Medication Adherence for Cholesterol (Statins) measure.

I use scikit-learn's KFolds to create the 10 groups.

I use scikit-learn's KFolds to create the 10 groups and KMeans algorithm to create the 5 clusters. I loop over the groups and use the KMeans algorithm to create the 5 clusters. To do this, I fit the model to each sample's data and assign each contract in the sample to a cluster.

Since higher rates reflect better performance, I use the minimum rate for each cluster for the cutpoint. The clusters are not ordered so sort them based upon the minimum rate and assign a label 1-5 stars.

Using plotly, I plot the data data using a scatter plot to view all of the points.

To create the final cutpoints, I use the minimum rate across all of the groups for each cluster.

Take a look at the output. It may not exactly match CMS' results because they take out some contracts that are impacted by a natural disaster. And, the sampling is random and may not yield the same results.

I can now put this all together and loop over several measures and contract types. I use pivot() to reorient the data.